OcrV1, Main, Exploration, bibRecord, 000181

Unsupervised profiling of OCRed historical documents

Identifieur interne : 000181 ( Main/Exploration ); précédent : 000180; suivant : 000182

Unsupervised profiling of OCRed historical documents

Auteurs : Ulrich Reffle [Allemagne] ; Christoph Ringlstetter [Allemagne]

Source :

Pattern recognition [ 0031-3203 ] ; 2013.

RBID : Pascal:13-0098799

Descripteurs français

Pascal (Inist)
- Classification non supervisée, Reconnaissance optique caractère, Moteur recherche, Bibliothèque électronique, Système 2 canaux, Vocabulaire, Disponibilité, Recherche information, Reconnaissance forme, Détection erreur, Correction erreur, Apprentissage, Classification signal.

English descriptors

KwdEn :
- Availability, Electronic library, Error correction, Error detection, Information retrieval, Learning, Optical character recognition, Pattern recognition, Search engine, Signal classification, Two channel system, Unsupervised classification, Vocabulary.

Abstract

In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) "global" information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) "local" hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000060
to stream PascalFrancis, to step Curation: 000708
to stream PascalFrancis, to step Checkpoint: 000024
to stream Main, to step Merge: 000184
to stream Main, to step Curation: 000181

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Unsupervised profiling of OCRed historical documents</title>
<author><name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>University of Munich, Center of Information and Language Processing</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<wicri:noRegion>Center of Information and Language Processing</wicri:noRegion>
<wicri:noRegion>University of Munich, Center of Information and Language Processing</wicri:noRegion>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>University of Munich, Center of Information and Language Processing</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<wicri:noRegion>Center of Information and Language Processing</wicri:noRegion>
<wicri:noRegion>University of Munich, Center of Information and Language Processing</wicri:noRegion>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">13-0098799</idno>
<date when="2013">2013</date>
<idno type="stanalyst">PASCAL 13-0098799 INIST</idno>
<idno type="RBID">Pascal:13-0098799</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000060</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000708</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000024</idno>
<idno type="wicri:doubleKey">0031-3203:2013:Reffle U:unsupervised:profiling:of</idno>
<idno type="wicri:Area/Main/Merge">000184</idno>
<idno type="wicri:Area/Main/Curation">000181</idno>
<idno type="wicri:Area/Main/Exploration">000181</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Unsupervised profiling of OCRed historical documents</title>
<author><name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>University of Munich, Center of Information and Language Processing</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<wicri:noRegion>Center of Information and Language Processing</wicri:noRegion>
<wicri:noRegion>University of Munich, Center of Information and Language Processing</wicri:noRegion>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>University of Munich, Center of Information and Language Processing</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<wicri:noRegion>Center of Information and Language Processing</wicri:noRegion>
<wicri:noRegion>University of Munich, Center of Information and Language Processing</wicri:noRegion>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Pattern recognition</title>
<title level="j" type="abbreviated">Pattern recogn.</title>
<idno type="ISSN">0031-3203</idno>
<imprint><date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Pattern recognition</title>
<title level="j" type="abbreviated">Pattern recogn.</title>
<idno type="ISSN">0031-3203</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Availability</term>
<term>Electronic library</term>
<term>Error correction</term>
<term>Error detection</term>
<term>Information retrieval</term>
<term>Learning</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Search engine</term>
<term>Signal classification</term>
<term>Two channel system</term>
<term>Unsupervised classification</term>
<term>Vocabulary</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Classification non supervisée</term>
<term>Reconnaissance optique caractère</term>
<term>Moteur recherche</term>
<term>Bibliothèque électronique</term>
<term>Système 2 canaux</term>
<term>Vocabulaire</term>
<term>Disponibilité</term>
<term>Recherche information</term>
<term>Reconnaissance forme</term>
<term>Détection erreur</term>
<term>Correction erreur</term>
<term>Apprentissage</term>
<term>Classification signal</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) "global" information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) "local" hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.</div>
</front>
</TEI>
<affiliations><list><country><li>Allemagne</li>
</country>
<region><li>Bavière</li>
<li>District de Haute-Bavière</li>
</region>
<settlement><li>Munich</li>
</settlement>
<orgName><li>Université Louis-et-Maximilien de Munich</li>
</orgName>
</list>
<tree><country name="Allemagne"><region name="Bavière"><name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
</region>
<name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000181 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000181 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:13-0098799
   |texte=   Unsupervised profiling of OCRed historical documents
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Unsupervised profiling of OCRed historical documents

Unsupervised profiling of OCRed historical documents

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri